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Electron diffraction of extremely small three-dimensional crystals (MicroED) 
allows for structure determination from crystals orders of magnitude smaller 
than those used for X-ray crystallography. MicroED patterns, which are 
collected in a transmission electron microscope, were initially not amenable to 
indexing and intensity extraction by standard software, which necessitated the 
development of a suite of programs for data processing. The MicroED suite was 
developed to accomphsh the tasks of unit-cell determination, indexing, 
background subtraction, intensity measurement and merging, resulting in data 
that can be carried forward to molecular replacement and structure 
determination. This ad hoc solution has been modified for more general use 
to provide a means for processing MicroED data until the technique can be fully 
implemented into existing crystallographic software packages. The suite is 
written in Python and the source code is available under a GNU General Public 
License. 



1. Introduction 

We recently presented the 2.9 A resolution structure of hen egg white 
lysozyme determined by electron diffraction of three-dimensional 
microcyrstals 2 x 2 x 0.5 |im in size (Shi et al, 2013). This technique, 
which we called MicroED, allows for structure determination from 
protein crystals that are more than six orders of magnitude smaller 
than those used for X-ray crystallography. In the future this may 
allow for protein structure determination for targets that have so far 
been unattainable. 

Electrons provide an alternative to X-rays for diffraction studies of 
small protein crystals. The larger ratio of elastic to inelastic scattering 
coupled with the significantly smaller amount of energy deposited by 
inelastic electron scattering events (Henderson, 1995) increases the 
quantity of data that can be collected from small crystals before they 
are destroyed by radiation damage. This is tempered by the limited 
penetration of the electron beam compared to X-rays, which has 
generally restricted electron crystallography to very thin two- 
dimensional crystals (Abeyrathne et al, 2011). Previous work to 
collect diffraction data from thin three-dimensional crystals found 
that crystals that were small enough to allow beam penetration were 
destroyed after the collection of only one or two diffraction patterns 
at the usual ^^20 e~ A~^ dose rate (Glaeser, 1971). Therefore 
diffraction patterns had to be collected from multiple crystals, and 
several groups have been developing software for processing misor- 
iented electron diffraction data (Jiang et al, 2009, 2011) An analo- 
gous problem has been encountered and solved in the development 
of X-ray free-electron laser (XFEL) -based microcrystal diffraction 
methods (Chapman et al., 2011; Boutet et al., 2012). 

MicroED data are collected using extremely low electron dose rate 
and under cryogenic conditions in a transmission electron microscope 
from crystals embedded in vitreous ice. A diffraction pattern is 
collected from a static crystal, which is then tilted to collect additional 
patterns at varying angles. Because the electron dose is very low, 
0.01 e~ s~^, multiple diffraction patterns ('still diffraction' tilt 



series) can be collected from a single crystal using a sensitive CMOS- 
based camera before significant radiation damage becomes apparent 
(Shi et al., 2013). Related methods such as electron diffraction 
tomography and the diffraction rotation technique have been 
described before (Kolb et al., 2007, 2008; Zhang et al., 2010). 

The electron diffraction data generated from MicroED are in 
principle no different than those from X-ray diffraction. Our attempts 
to index the diffraction patterns using MOSFLM (Leslie & Powell, 
2007) resulted in errors, apparently due to a combination of the size 
of the Ewald sphere, inaccuracy in the tilt angles reported by the 
microscope compustage, and the fact that the pattern arises from 
static crystals while the program was expecting crystal oscillation. 
Such difficulties in processing electron diffraction data with 
MOSFLM have been reported by others (Nederlof et al, 2013). 
However, some success was reported with MOSFLM when rotation 
electron diffraction was used, after the diffraction patterns had been 
centered and bad pixels removed. Other software, which was written 
specifically for processing electron diffraction data from patterns 
where the relational angle is known or not, is also available and 
includes EDIFF and RED (Jiang et al., 2011; Wan et al., 2013). Such 
software may have been useful in processing MicroED data, but we 
had difficulties in implementing the programs on our systems. 
Therefore, a suite of software was written to process the MicroED 
data for our initial proof of concept experiments, with the eventual 
goal of integration of these techniques into currently existing soft- 
ware such as MOSFLM (Lesfie & Powell, 2007) or CCTBX (Adams 
et al, 2010). 

The MicroED suite contains eight programs that work together to 
accomplish essential data processing tasks: determination of the unit- 
cell size and orientation, spot prediction, indexing, and measuring 
spot intensities: 

Cataspot.py: a graphical user interface (GUI) which allows the user 
to identify spots in diffraction patterns, record their x, y coordinates 
and prepare the data files used by subsequent programs. 
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findjengths: rough determination of unit-cell lengths. 

calcjucvectors: rough determination of the vectors that describe 
the unit cell in reciprocal space. 

spotjndex: indexing of the spots for finer unit-cell vector deter- 
mination. 

refine _spots'. refinement of the spots for unit-cell vector determi- 
nation. 

UCR_index: re-indexing of the images using the refined spots. 

recalculate _vectors'. calculation of more accurate unit-cell vectors 
from the new indexing. 

measure Jntensities: measurement of background-subtracted spot 
intensities. 

The MicroED suite programs are designed to be cross-platform 
with any necessary modules, libraries and/or outside programs 
commonly available. All of the programs are implemented in Python 
2.7 using the standard modules numpy (Oliphant, 2007) and Python 
Image Library (Secret Labs, 2013). The GUI requires Tkinter 
(Python Software Foundation, 2013), also a common python module. 
Two outside programs are required: Gnuplot (Williams & Kelley, 
2011) and ImageMagick (ImageMagick, 2013), both of which are 
freely available. Any additional image processing can be performed 
with FIJI (Schindelin et aL, 2012). 



2. Initial data processing 

The raw images used for data processing must be in tif format. This is 
a standard output option for EM-MENU4 (Ghadimi et aL, 2009), 
which was used for data collection in our initial work (Shi et aL, 2013). 
Other EM data formats can be converted into tif format using 
programs such as em2em (Image Science Software, 2013). In order to 
conserve memory, all of the illustrations drawn by the programs use a 
gif version of each image, which can be produced using FIJI or any 
other image manipulation program, the only requirement being that 
the gif version has the same name as the original file. 



3. Cataspot 

Cataspot's GUI (Fig. 1) allows for batch processing of images, 
prompting the user to select points on the image and then deter- 
mining relevant parameters and writing a data file used by the 
subsequent programs. 

The first procedure performed by Cataspot is the determination of 
the beam center, which is calculated on the basis of one or more 
Freidel pairs selected by the user. The user is also able to specify the 
beamstop center, used later to calculate the beamstop mask. The 
program then allows the user to select additional spots to be used for 
unit-cell determination and reference spots which will be used to 
define the plane of the image for spot prediction. 



4. Unit-cell determination 

The electron diffraction data processing suite ED IFF (Jiang et aL, 
2011) provides a platform for determining unit-cell parameters from 
single electron diffraction patterns obtained from randomly oriented 
crystals. We encountered difficulties in implementing EDIFF on our 
systems, and inputting MicroED data into EDIFF caused unknown 
errors, which precluded us from using this program. 

In our original paper, the structure determination of lysozyme was 
started with a priori knowledge of unit-cell dimensions and angles for 
the crystal (Cipriani et aL, 2012). Therefore a simplified unit-cell 



determination procedure was used only to verify the expected 
dimensions and angles. 

De novo unit-cell determination using updated programs is now 
possible but will require significant trial and error on the part of the 
user. Techniques such as the Rossman Fourier analysis method used 
by MOSFLM (Powell, 1999; Steller et aL, 1997) or the 'facet 
matching' method of EDIFF would be much more effective for 
determination of an unknown unit cell. It is recommended to use the 
MicroED suite unit-cell determination programs as a last resort and 
to independently verify unit-cell dimensions using an outside 
program. 



4.1. Determination of unit-cell lengths: find lengths 

Unit-cell determination begins with the user picking 100-1000 
spots from several diffraction patterns of various tilts. The user 
should attempt to choose a variety of spots so as to minimize the 
accidental selection of multiple appearances of a low-angle reflection 




Figure 1 

The Cataspot GUI. An electron diffraction pattern with the Cataspot GUI 
operating on the boxed region. Several user-selected spots including a centering 
spot (red plus), calculated beam center (red circled plus), beam stop center (yellow 
plus) and user chosen spot (green cross) are shown. 
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over multiple patterns. A vector v is calculated for each spot, which 
defines its position relative to Cartesian coordinates (0, 0, 0): 

\={x,y,z), (1) 



X = — 



y = cosO (yi-yj, 
z = sin 6* (yi-yj-a, 



(2) 
(3) 
(4) 



where Xi and yi are the x and y coordinates on the diffraction pattern 
image, Xc and yc are the x and y coordinates of the beam center, 0 is 
the tilt angle, and a is a correction for the Ewald sphere curvature. 
Although Ewald sphere curvature also affects the x and y coordi- 
nates, it was determined that this difference was so small ('^l pixel at 
2.0 A resolution) that it can be ignored. The effect of Ewald sphere 
curvature on the calculation of the z component of v is significant 
(^12 pixels at 2.0 A for our data set using our camera and microscope 
combination), so a is calculated as 



a = —b 



1 - 



b = (l/k)c, 



2\V2 



(5) 



(6) 



where k is the electron wavelength, x and y are as calculated above, 
and c is the angstrom to pixels conversion factor. 



Unit Cell Predictions 




250 



Figure 2 

Unit-cell predictions for lysozyme by 1 Jindjengths.py. Predictions were made with 
approximately 900 spots chosen from 20 images over a —20 to 20° tilt. Peaks for the 
correct a and b (55 pixels) and c (112 pixels) unit-cell lengths are denoted with 
arrows. Peaks for a 2b (65 pixels), 2(a 2b) (130 pixels), a ^ 2c (156 pixels) 
and a 3b (173 pixels) are also apparent (marked with stars). 



After V is calculated for each spot, the distance between spots d is 
calculated for every pair of spots. For spots defined by v« and v^, 

rf=IK-v,||. (7) 

The unit-cell lengths can be estimated from the distribution of d for 
all spots. For two spots with adjacent Miller indices, d is equal to a 
unit-cell dimension. This distance cannot be smaller than the smallest 
unit-cell dimension, so the smallest peaks in the distribution of d 
represent the unit-cell dimensions. This process is not exact and still 
requires some user intuition. For example <i(ooo)(ioo) (between Miller 
indices 100 and 000) equals the a unit-cell dimension, but <i(iio)(ooo) 
might be smaller than <i(ooo)(ooi) depending on the unit-cell dimen- 
sions. This, along with the possibility of multiple unit-cell dimensions 
having the same length, or one or more unit-cell lengths being close 
multiples of each other, means the user cannot simply pick the three 
shortest values of d as the unit-cell dimensions. The general formula 
for these cross-unit-cell vectors for a unit cell a, b, c with angles a, 13, y 
is 



■^(nnn)(n+m,n+p,n) 



[(ma + b cos y)^ + (pb sin y)^]^^^ , 



(8) 



where m and p are integers. This allows for the calculation of the 
expected peaks in the distribution of d, which can be compared with 
the observed distribution and used to verify that the correct unit-cell 
dimensions have been chosen (Fig. 2). Observations of diffraction 
patterns that hit on or near major planes of the crystal also allows 
rough measurements of the unit-cell dimensions and angles directly 
from the patterns, which can be used to verify these findings. 



4.2. Initial determination of unit-cell orientation: calc ucvectors 

determined, the 
This is initially 



Once rough unit-cell dimensions have been 
orientation of the unit cell can be established, 
accomplished by using the spots that were chosen by the user. First 
the vectors d between all of the chosen spots are calculated: 



da,z,=v, -Vft. (9) 

All of the vectors are compared with the three unit-cell dimensions 
and those within a user-specified threshold are kept. The remaining 
vectors are then compared with four reference vectors with Cartesian 
coordinates (1, 0, 0), (0, 1, 0), (0, 0, 1) and (1, 1, 0). The angle (a) 
between the each vector and the reference vector is calculated by 



d r 



(10) 



where d is the difference vector and r is the reference vector. This 
allows the vectors to be divided into roughly parallel groups based on 
the angles between the vector and the four reference vectors. 

The orientations of the vectors in each group are determined by 
calculating the cross product of the vector and the (1, 0, 0) reference 
vector, and appropriate vectors are flipped, by multiplying by —1, so 
all vectors in each group are oriented in the same direction. The 
vectors in each group are averaged to produce a list of candidates for 
the unit-cell vectors. Each candidate is assigned a score based on the 
number of vectors that contributed to it. By examining the angles 
between the candidates and their scores, the correct unit-cell vectors 
can usually be chosen. 

4.3. Refinement of the vectors: spot index, refine spots, 
UCR index and recalculate vectors 

Once the three vectors defining the unit cell have been chosen, 
they are used to predict spots on each image. The unit-cell vectors 
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{a^, a , a^), {b^, b ,b^) and (c^, c , c^) are used to create a unit-cell 



matrix 




(11) 



Two reference spots are chosen from each image. These spots are 
chosen because they have strong intensity and are thought to 
represent complete intensities where the Ewald sphere passed 
directly through the center of the spot. The x, y, z coordinates of the 
reference spots are calculated as above and their Miller indices 
determined by multiplying the x, y, z coordinates with the inverse 
unit-cell matrix: 



(12) 



These values are rounded to the nearest integer, and a 'check vector' 
(q) normal to the plane containing the two reference points is 
calculated as 




q = {d, ej) = (/zi, /ci, l^) x (/z2, /c2, k)- 



(13) 



Every Miller index is then compared with the check vector. For any 
given Miller index hkl, the dot product of the Miller index and q can 
be used to determine if that Miller index lies on the same plane as the 
two reference spots. 

The two reference points are known to exist because they are 
visible on the diffraction pattern, so this can be used to predict the 
other spots that should appear on each diffraction pattern. The 
quality of the reference points chosen is critically important. The 
spots must be the user's best estimation of Bragg peaks that were 
perfectly bisected by the Ewald sphere. A good rule of thumb is to 
choose spots that are of high relative intensity and have adjacent 




ib) 




ic) 



Figure 3 

A demonstration of how the quality of reference points affects the accuracy of spot 
prediction, (a) A lysozyme diffraction pattern indexed with poor-quality reference 
points, (b) A zoomed-in view of the region of panel (a) bounded by the dashed line, 
(c) The same diffraction pattern indexed with higher-quahty reference points, (d) A 
zoomed-in view of the region bounded by the dashed line in (c). 



Spots visible on both sides. Fig. 3 illustrates a diffraction pattern 
indexed with two different sets of reference points, demonstrating the 
effects of the reference set on the overall quality of the indexing. 

The probability that the dot product of any given Miller index and 
the check plane is exactly zero is very small. The calculation of the 
check plane is based on the locations of the reference spots, which 
introduces error as the measurement of these x, y coordinates will 
never be exactly perfect. To cope with this noise, the spots are instead 
compared with L; a 'Laue zone threshold', so named because its 
functional effect is to determine the widths of Laue zones in the spot 
predictions. Modifying the L value allows for compensation for 
inaccuracy in the tilt angle by expanding the size of the predicted 
Laue zones (Fig. 4). Raising this threshold results in the prediction of 
more spots but also increases the number of partial intensities 
recorded and 'false positives', indexing where no spot is actually 
observed. 

After a list of spots has been created for each image, their x, y, z 
coordinates are calculated by 



(14) 



The X, y, z coordinates are then used to calculate the coordinates of 
the spot in two dimensions (x', 




- X, y' = y/ cos d 



(15) 



These coordinates are used to draw circles around the predicted spots 
on the diffraction patterns for visual inspection. 

The initial spot predictions are dependent on the accuracy of the 
unit-cell vectors, which were determined from a limited set of points 
picked by the user. A second iteration of the vector finding process 
allows the refinement of the vectors for more accurate spot predic- 
tion. 



(«) 




id) 




Figure 4 

The effects of changing the Laue zone threshold on spot prediction. Predicted spots 
with 15% {a) and {b) and 30% (c) and {d) Laue zone thresholds drawn on a 
lysozyme diffraction pattern. 
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The predicted spots are first refined by mass centering. A square 
box is drawn around each spot and the pixel values put in a matrix. 
Each row and column of the matrix is summed and the maximum 
pixel values of the rows and columns used to determine the actual 
center of mass for the spot. The box is moved to this center and the 
mass centering process repeated. If the second round of mass 
centering produces a large movement (more than one or two pixels in 
any given direction) the spot is discarded. This is to prevent the spot 
prediction from 'walking' between Miller indices. 

After mass centering, the intensity of the spot is compared with the 
background intensity. A square and a circle where the circle diameter 
is equal to the square edge length are drawn, centered on the spot. 
The background intensity is defined as the mean pixel intensity of the 
area bounded by the square but outside the circle. The mean intensity 
of the area inside the circle is compared with the background 
intensity. Any spot with a low spot-to-background ratio is discarded. 
Because only intense spots that are cleanly bisected by the Ewald 
sphere are desired for unit-cell determination, this threshold is set 
high, usually around 10%. 

The fist of refined spots is then used to recalculate the unit-cell 
vectors. Because this list contains more spots and their locations are 
more accurate, the recalculated vectors produce better spot predic- 
tion and indexing. This process can be repeated iteratively until the 
unit-cell vectors are stable and accurate. 



5. Indexing and intensity measurement: measure intensities 

Once satisfactory unit-cell vectors have been obtained, the diffraction 
pattern image is indexed for a final time. The last set of spot indices is 
not mass centered. At this point the indexing should be accurate 
enough to capture all of the spots, and mass centering raises the risk 
of a spot 'walking' to an adjacent Miller index, which would lead to 
the intensity being attributed to the wrong reflection. 

When the final indexing is complete the intensity of each spot is 
measured. The mean background is calculated for each spot as above 
and subtracted from each pixel within the circle, and the sum of the 
background-subtracted pixel values is recorded for that Miller index. 
The same mean spot intensity to background intensity comparison is 
then made as before, but a much lower threshold, usually ~0.5%, is 
used to capture weak spots. 



6. Merging: p422 merge maxonly and p422 merge thresh 

After all of the images have been indexed and the intensities 
extracted, intensity measurements from symmetry-related Miller 
indices must be merged. The symmetry relations of the different 
Miller indices are determined by the specific space group of the 
crystal. The proof of concept work took advantage of the a priori 
knowledge of the crystal space group. Without this information the 
space group must be determined by examining the unit-cell dimen- 
sions, angles and systematic absences using a tool such as POINT- 
LESS (Evans, 2006). Merging programs were written specifically for 
/?422 symmetries; merging data from other symmetries would require 
modification of the program. 

Because our data originated from a static crystal (still shots), the 
probability of collecting partial reflections became much higher (Shi 
et aL, 2013). This led to inaccurate intensity measurements unless the 
partial reflections were scaled or excluded. To cope with this issue in 
our original work a strict cutoff was imposed. The program 
p422_merge_maxonly merges the data based on p422 symmetry. For 
any given reflection the largest recorded intensity was assumed to 



closely represent the complete reflection. Any measurements for that 
Miller index with smaller intensities were discarded. This is a crude 
method which precludes the calculation of i^merge- Another program, 
p422_merge_thresh, allows the user to specify an i^merge cutoff: only 
spots within a specified range of the maximum recorded intensity for 
that Miller index are used for merging. Fig. 5 illustrates the effects of 
imposed cutoffs on the final Rmerge and Pearson correlation coeffi- 
cient for a lysozyme X-ray diffraction data set collected in-house. The 
full merged data set had an i^merge of 0-32 and 0.55 correlation to the 
X-ray data set. Overall the strict 1.0 cutoff (i.e. maximum measure- 
ments only) improved the cross correlation by approximately 10%, 
although most of this improvement was realized using a more 
permissive 0.1 cutoff. 

The final output of the merging programs is a text file containing 
the Miller index, intensity, structure factor, Sigl and SigF for each 
reflection. For intensity measurements originating from a single 
observation, Sigl and SigF values cannot be calculated and they are 
instead estimated as the square root of the intensity and the square 
root of the structure factor, respectively. The output of the merging 
program can then be fed into the program COMB A T from the CCP4 
suite (Winn et aL, 2011) to generate an mtz file, which can be used for 
downstream applications. 



7. Discussion 

This MicroED suite represents a refinement of an ad hoc software 
solution initially written for the determination of the structure of 
lysozyme by MicroED (Shi et aL, 2013). The programs were initially 
written in response to problems processing the data using currently 
available software and contain many workarounds resulting from 
logistical limitations that were described before as well as here. 
Although the programs have been modified for general use and now 
include a more user-friendly GUI they are not intended to be a 
mature suite for data processing. The final goal of this project is the 
integration of the MicroED techniques into currently available 
crystallography software. This should be concurrent with methodo- 
logical improvements in MicroED. 




0.6 

RiiMf|8 Cutoff 

Figure 5 

The effects of imposed Rmerge cutoffs. The results of merging a lysozyme data set 
containing 36 823 intensity measurements with varying i^merge cutoffs, showing 



of the merged data set and its Pearson cross correlation to an X-ray 



diffraction data set collected from the same batch of crystals. In all cases the final 
merged data set contained 5460 intensity measurements. 
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8. Software availability 

All of the programs in the MicroED suite are available at http:// 
www.github.com/gonenlab/2013UED.git. 

The authors would like to thank Don Olbris (JFRC) for the 
development of the Cataspot GUI, and Dan Shi and Brent Nannenga 
(JFRC) for many helpful discussions and critical reading of the 
manuscript. The Gonen laboratory is supported by the Howard 
Hughes Medical Institute. 
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